24 research outputs found

    Cost-Sensitive Learning-based Methods for Imbalanced Classification Problems with Applications

    Get PDF
    Analysis and predictive modeling of massive datasets is an extremely significant problem that arises in many practical applications. The task of predictive modeling becomes even more challenging when data are imperfect or uncertain. The real data are frequently affected by outliers, uncertain labels, and uneven distribution of classes (imbalanced data). Such uncertainties create bias and make predictive modeling an even more difficult task. In the present work, we introduce a cost-sensitive learning method (CSL) to deal with the classification of imperfect data. Typically, most traditional approaches for classification demonstrate poor performance in an environment with imperfect data. We propose the use of CSL with Support Vector Machine, which is a well-known data mining algorithm. The results reveal that the proposed algorithm produces more accurate classifiers and is more robust with respect to imperfect data. Furthermore, we explore the best performance measures to tackle imperfect data along with addressing real problems in quality control and business analytics

    Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values

    Full text link
    This work is motivated by the needs of predictive analytics on healthcare data as represented by Electronic Medical Records. Such data is invariably problematic: noisy, with missing entries, with imbalance in classes of interests, leading to serious bias in predictive modeling. Since standard data mining methods often produce poor performance measures, we argue for development of specialized techniques of data-preprocessing and classification. In this paper, we propose a new method to simultaneously classify large datasets and reduce the effects of missing values. It is based on a multilevel framework of the cost-sensitive SVM and the expected maximization imputation method for missing values, which relies on iterated regression analyses. We compare classification results of multilevel SVM-based algorithms on public benchmark datasets with imbalanced classes and missing values as well as real data in health applications, and show that our multilevel SVM-based method produces fast, and more accurate and robust classification results.Comment: arXiv admin note: substantial text overlap with arXiv:1503.0625

    Engineering Fast Multilevel Support Vector Machines

    Get PDF
    The computational complexity of solving nonlinear support vector machine (SVM) is prohibitive on large-scale data. In particular, this issue becomes very sensitive when the data represents additional difficulties such as highly imbalanced class sizes. Typically, nonlinear kernels produce significantly higher classification quality to linear kernels but introduce extra kernel and model parameters which requires computationally expensive fitting. This increases the quality but also reduces the performance dramatically. We introduce a generalized fast multilevel framework for regular and weighted SVM and discuss several versions of its algorithmic components that lead to a good trade-off between quality and time. Our framework is implemented using PETSc which allows an easy integration with scientific computing tasks. The experimental results demonstrate significant speed up compared to the state-of-the-art nonlinear SVM libraries. Our source code, documentation and parameters are available at https://github.com/esadr/mlsvm

    Predictive Models for Bariatric Surgery Risks with Imbalanced Medical Datasets

    Get PDF
    Bariatric surgery (BAR) has become a popular treatment for type 2 diabetes mellitus (T2DM) which is among the most critical obesity-related comorbidities. Patients who have bariatric surgery, are exposed to complications after surgery. Furthermore, the mid- to long-term complications after bariatric surgery can be deadly and increase the complexity of managing safety of these operations and healthcare costs. Current studies on BAR complications have mainly used risk scoring for identifying patients who are more likely to have complications after surgery. Though, these studies do not take into considera-tion the imbalanced nature of the data where the size of the class of interest (patients who have complications after surgery) is relatively small. We propose the use of imbalanced classification techniques to tackle the imbalanced bariatric surgery data: synthetic minority oversampling technique (SMOTE), random undersampling, and en-semble learning classification methods including Random Forest, Bagging, and AdaBoost. Moreover, we improve classification performance through using Chi-Squared, Information Gain, and Correlation-based feature selection (CFS) techniques. We study the Premier Healthcare Database with focus on the most-frequent complications includ-ing Diabetes, Angina, Heart Failure, and Stroke. Our results show that the ensemble learning-based classification techniques using any feature selection method mentioned above are the best approach for handling the imbalanced nature of the bariatric surgical outcome data. In our evaluation, we find a slight preference toward using SMOTE method compared to the random undersampling method. These results demonstrate the potential of machine-learning tools as clinical decision support in identifying risks/outcomes associated with bariatric surgery and their effectiveness in reducing the surgery complications as well as improving patient care

    A Weighted Support Vector Machine Method For Control Chart Pattern Recognition

    No full text
    Manual inspection and evaluation of quality control data is a tedious task that requires the undistracted attention of specialized personnel. On the other hand, automated monitoring of a production process is necessary, not only for real time product quality assessment, but also for potential machinery malfunction diagnosis. For this reason, control chart pattern recognition (CCPR) methods have received a lot of attention over the last two decades. Current state-of-the-art control monitoring methodology includes K charts which are based on support vector machines (SVM). Although K charts have some profound benefits, their performance deteriorate when the learning examples for the normal class greatly outnumbers the ones for the abnormal class. Such problems are termed imbalanced and represent the vast majority of the real life control pattern classification problems. Original SVM demonstrate poor performance when applied directly to these problems. In this paper, we propose the use of weighted support vector machines (WSVM) for automated process monitoring and early fault diagnosis. We show the benefits of WSVM over traditional SVM, compare them under various fault scenarios. We evaluate the proposed algorithm in binary and multi-class environments for the most popular abnormal quality control patterns as well as a real application from wafer manufacturing industry. © 2014 Elsevier Ltd. All rights reserved

    Weighted Relaxed Support Vector Machines

    No full text
    Classification of imbalanced data is challenging when outliers exist. In this paper, we propose a supervised learning method to simultaneously classify imbalanced data and reduce the influence of outliers. The proposed method is a cost-sensitive extension of the relaxed support vector machines (RSVM), where the restricted penalty free-slack is split independently between the two classes in proportion to the number samples in each class with different weights, hence given the name weighted relaxed support vector machines (WRSVM). We compare classification results of WRSVM with SVM, WSVM and RSVM on public benchmark datasets with imbalanced classes and outlier noise, and show that WRSVM produces more accurate and robust classification results

    Constraint Relaxation, Cost-Sensitive Learning And Bagging For Imbalanced Classification Problems With Outliers

    No full text
    Supervised learning consists in developing models able to distinguish data that belong to different categories (classes). When data are available in different proportions the problem becomes imbalanced and the performance of standard classification methods deteriorates significantly. Imbalanced classification becomes even more challenging in the presence of outliers. In this paper, we study several algorithmic modifications of support vector machines classifier for tackling imbalanced problems with outliers. We provide computational evidence that the combined use of cost sensitive learning with constraint relaxation performs better, on average, compared to algorithmic tweaks that involve bagging, a popular approach for dealing with imbalanced problems or outliers separately. The proposed technique is embedded and requires the solution of a single convex optimization problem with no outlier detection preprocessing
    corecore